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ABSTRACT 

Motivation: Latent periodic elements in genomes play important 
roles in genomic functions. Many complex periodic elements in 
genomes are difficult to be detected by commonly used digital signal 
processing (DSP). We present a novel method to compute the 
periodic power spectrum of a DNA sequence based on the nucleotide 
distributions on periodic positions of the sequence. The method 
directly calculates full periodic spectrum of a DNA sequence rather 
than frequency spectrum by Fourier transform. The magnitude of 
the periodic power spectrum reflects the strength of the periodicity 
signals, thus, the algorithm can capture all the latent periodicities in 
DNA sequences. 

Results: We apply this method on detection of latent periodicities in 
different genome elements, including exons and microsatellite DNA 
sequences. The results show that the method minimizes the impact 
of spectral leakage, captures a much broader latent periodicities in 
genomes, and outperforms the conventional Fourier transform. 
Availability: The source code for the algorithms described in this 
paper can be accessed at MATLAB file central. 

Contact: cyin1@uic.edu 


1 INTRODUCTION 

Regions of approximate tandem repeats in genomes are abundant in 
many species from bacteria to mam mals, and are essential for man y 
structures and functions of genomes l lShapiro and StembergLl2Qo3) . 
For example, many researchers revealed that the 3-periodicity 
of a DNA sequence indicates protein co ding regions and it is 
used for protein co ding region identication iSilverman and Linsken . 
ll986l : lTiwari et alii 19971) . Pervasive hidden 10 — 11 bp periodicities 
in complete g enome s reflect protein structure and DNA folding 
dl ferzel et all 1 19991) . In bacteria genomes, repetitive DNA 
sequences form alte rnative DN A conf ormations and thus induce 
genetic instability dWoicik et all 1 20 l~2l) . Repetitive DNA sequence 
elements are also fundamental to the cooperative molecular 
interactions forming nucleoprotein complexes. In a human 
genome, simple tandem D NA repeats may associate with human 
neurodegenerative diseases dSutherland and Richardsl 119951) . The 
short tandem repeats (STRs) may have a wide range of applications, 
including med i cal genetics, forensics, and genetic genealogy 
dGvmrek et all 120121 ). In addition, repeats in genomes often 


cause asse mbling and indels discovery problems in next-generation 
sequencing (Treange n and Salzber g.l201 l nNarzisi and Schata.120151) . 

Repetitive sequences and periodicities in genomes vary in size , 
scale and complexity dTrifonovl 119981 : lHauth and Josephl . 120021) . 
The size of repeats can be from two base pairs to hundreds of 
base pairs. They can disperse widely in a relatively long sequence 
range, and can be just a tandem arrays of simple sequence 
composition. Some repeats contain partial periodic sequence 
patterns, and some have hidd en peri odicity due to multiple periodic 
components dKorotkov et all |2003i) . Thus, latent repeat signals in 
DNA sequences are difficult to be analyzed by straightforward 
observation and comparison. Accurate identification these repeats 
and periodicities is one of important problems in genome analysis. 

The latent periodicities and repetitive features of DNA 
sequences are primarily studied by digital signal processing (DSP) 
approaches such a s Four i er transform l Silverman and Linsken . 

1 1 9861 : ISharma et al. 2004 1 _ Buchner and Janiarasiittl 2003 ). and 
wavelet transform l Wang and SteinL l2010l) . The DNA sequences 
are converted to a numeric sequence and then Fourier transform 
is ap plie d for powe r spectrum analysis in the DNA sequence 
dAfreixo et all 120041) . T he other periodic ity detection methods 
include stat istics method l Epps et al., 1201 ll) . maximum likelihood 
estimatio n jArora and Setharesl 2007 ), informati on decomposition 
jKorotkov et al.ll2003l) . direct frequency mapping i GluncicandPaaij, 


2013) , and chaos game representation (CGR) (Messaoudi et al., 

2014) . For review of these methods, refer to lGrover et alj d2012l) . 


In spite of many studies on hidden periodicities in DNA sequences, 
due to the complexity of repetitive sequences and large amount of 
background noise presented in Fourier analysis and other methods, 
identification of all the hi dden periodicities in DNA sequences is 
still a challenging probl em dSuvorova et alll2014l : lEpps et al IH; 
llllingworth et alll2008h . 

In this paper, we present a novel algorithm for computing periodic 
power spectrum (PPS) based on periodic distribution of nucleotides 
of DNA sequences. We apply the algorithm to the identification 
of hidden periodicities in DNA sequences. The results demonstrate 
that this algorithm is effective and efficient to capture quantitatively 
all the latent periodicities in DNA sequences and outperforms the 
methods based on Fourier transform. 
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2 METHODS AND ALGORITHMS 

2.1 Numerical representations of DNA sequences 

DNA molecules consist of four linearly linked nucleotides, adenine 
(A), thymine (T), cytosine (C), and guanine (G). A DNA sequence 
can be represented as a permutation of four characters A, T, C, 
G of different lengths. To apply DSP in DNA sequence analysis, 
the character strings of DNA mole cu les m ust be mapped into one 
or more numerical sequences fyir l I 2014 I 1 . O ne of the m ethods in 
literatures is to use binary indicator sequences JVossiJ 19921) . A DNA 
sequence, denoted as, s(0), s(l),... , s(N— 1), can be decomposed 
into four binary indicator sequences, UA(n),UT{n),uc{n), and 
ug(u), which indicate the presence or absence of four nucleotides, 
A, T, C, and G at the n— th position, respectively. The indicator 
mapping of DNA sequences is defined as follows: 

{ 1, s(n) = a 

(1) 

0, otherwise 

where a € A, C, G, T, n = 0,1,2,..., N — 1. The four indicator 
sequences correspond to the appearance of the four nucleotides at 
each position of the DNA sequence. For example, the indicator 
sequence, ua{ji ) = 0001010111 ..., indicates that the nucleotide 
A presents in the positions of 4, 6, 8, 9, and 10 of the DNA sequence. 
Table 1 illustrates and example of the Voss 4D binary indicator 
mappings of a DNA sequence. 

Table 1. The Voss 4D binary indicator mappings of a short DNA sequence. 


DNA 

T 

A 

G 

c 

c 

T 

G 

c 

T 

G 

A 

T 

UA 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

Ut 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

uc 

0 

0 

0 

1 

1 

0 

0 

1 

0 

0 

0 

0 

UG 

0 

0 

1 

0 

0 

0 

1 

0 

0 

1 

0 
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2.2 Fourier power spectrum 

Discrete Fourier transform (DFT) is the transformation of 
observation data in time domain to new values in frequency domain. 
It gives a unique representation of the original signal in frequency 
domain. DFT spectral analysis of DNA sequences may detect any 
latent or hidden periodical signal in the original sequences. It may 
discover approximate repeats that are difficult to be detected by 
tandem repeat search. Let X be the DFT of real number time series 
x of length N, then X ( k ) is defined as 

JV-l 

X(k) = Y x(n)e~ i2n ^ n , k = 0,1, • • • ,1V - 1 (2) 

71 = 0 

where i = s/—l. The frequency domain vector X contains all the 
information about the original signal x and can recover the signal. 
The DFT power spectrum of the signal x(n) at the frequency k is 
defined as 

PS(k) = X(k)X(k)*, fc = 0,1. • • • ,1V — 1 (3) 

where X(k)* denotes the complex conjugate of X(k). Similarly, 
for a DNA sequence, let U a and PS a be the Fourier transform 


and corresponding Fourier power spectrum of the binary indicator 
sequence u a ,a £ {A,T,C,G} the DNA sequence, the Fourier 
power spectrum of the DNA sequence is the sum of Fourier power 
spectra PS a . 

PS( k) = Y p S«{k) (4) 

ae{A,T,C,G} 

The Fourier power spectrum of DNA sequences can be used 
to identify periodicities in DNA sequence. For example, most 
of protein coding regions show a prominent peak at frequency 
k = N/3 in Fourier power spectrum because of the variance o f 
nucleotide distributions in the three codon positions J Yi n and YauL 
l2005h . The Fourier power spectrum is large when the nucleotide 
has a significant tendency of appearing about every p positions. In 
particular, when k = N/3, namely a tends to appear at a certain 
codon position. It shall be noted that because Fourier power spectra 
of real number series are symmetric, the Fourier power spectrum 
plots only show the first half of the spectrum. Furthermore, because 
DFT power spectrum at frequency zero is a constant and equivalent 
to the sum of data points, the power spectrum at frequency zero is 
not included in plotting. 

2.3 Periodic power spectrum 

The strength of a periodicity within a ID real number signal is 
determined by the distribution of real values on periodic positions. 
Thus, the congruence derivative vector of a ID real number 
sig nal, which represen ts these periodic distribution, was introduced 
by IWang et all ( 12012h . For DNA sequences, it has been shown 
that Fourier power spectrum at a periodicity p is determined by 
the unique distribution of nucleotides on the periodic positions. 
For example, the 3-base periodicity is determined non-uniformed 
dist ribution of nucleotides on the three positions dYin and YauL 
l2005h . Here, we define the congruent derivative vectors that 
reflect the nucleotide distributions on the periodic position. For 
example, for a DNA sequence, AGTTAACGCCTAGCC, when it is 
mapped into the Voss 4D binary sequences ua, ut, uc, ug, of the 
congruence derivative vectors of periodicity 3 reflect the nucleotide 
distributions on periodic-3 positions in the DNA sequence, and 
have the following values: /a = [1,1,2],/t = [1,1,1 \, fc = 
[2,1, 2], f G = [1, 2,0], respectively. 

The congruent derivative vector of a ID real number signal is 
defined as follow 

DEFINITION 2.1. For a real number signal x of length N, the 
element of congruence derivative vector f p of the signal x, for 
periodicity p, is defined as 

fpid) = Y 

mod ( n,p)=q 

9 = 0, l, - ,p- l,n = 0,1, ••• ,N- 1 

where mod(n , p) is the modulo operation and returns the remainder 
after division ofn by p, then f p = (f p ( 0), f p ( 1), • • • , f p (p — 1)). 

We may consider a frequency domain transform is to project the 
ID real number signal x into some periodic shift sequences. Let 

— - 2-7T 

lo p = e 1 p be the p— th root of unity, we have the following 
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periodic shift sequences as basis vectors. 

f Wp, if mod(n - q,p) = 0 
Sp( n ) = { 

I 0 , otherwise ( 6 ) 

where q = 0 , 1 , • • • ,p — 1 

Table 2 is an example of the periodic shift sequences S p as basis 
vector for periodicity 4. 


Table 2. Periodic shift sequences for periodicity 4 


n 

0 

1 

2 

3 

4 

5 

6 

7 ■ • ■ 


U 

CJ 4 

0 

0 

0 

U 

OJ 4 

0 

0 

0 ■■■ 

5 l(n) 

0 

Cd 4 

0 

0 

0 

uj\ 

0 

0 ■■■ 

5 i(n ) 

0 

0 

uj‘i 

0 

0 

0 

uj'i 

0 

5i(n) 

0 

0 

0 

3 

CJ 4 

0 

0 

0 

3 

CJ 4 • • • 


We propose here the periodic transform of a real number signal x 
at periodicity p as the projection the signal onto the corresponding 
periodic basis vectors 5 p as in equation (7). The periodic transform 
is defined as 


Definition 2.2. 

N-l p -1 

XP(p) = EE x(n)5p(ri),p = 1, 2, • • • ,N 

71=0 < 3=0 


Specially, if the length of a signal is equivalent to a multiple of a 
periodicity p , we have following theorem. 


THEOREM 1. IfN is equivalent to a multiple ofp, the projection 
of signal x to the periodic basis sequences 5 p is equivalent to 
Fourier transform of the congruent derivative vector of the signal. 

The proof of the theorem 1 is provided bv lWane et al.l d2012l) . Let 
5 p = ( ujp,u)p, ■ ■ ■ ,cUp -1 ), the periodic transform at periodicity p 
may be re-written as 

XP(p) = fp5l (7) 

where Sp is the transpose of 5 P . 

The power spectrum at periodicity p can be then expressed as 

PPS(p) = fp(5^Sp)ff ( 8 ) 

where f p is the transpose of f p and Sp is the transpose of S p . The 
matrix that is formed by Sp S p only depends on periodicity p and is a 
complex symmetric matrix with the main diagonal elements one. If 
we add the corresponding upper and lower entries of the matrix, we 
can get a lower-triangular real number matrix, Sp. The matrix Sp 
can be used to compute the periodic power spectrum. The entries of 
the matrix S p are defined as: 


S P (k,j)={ 


2cos 2 ^- 1 >cos 2 ^- 1) 

p p 

. 2 n(k — 1 ) . 2 n(j — 1 ) 

+ 2 sin —--- sin —---,if k> j 


1 , if k = j 
0 , if k < j 


(9) 


Based on above theories, we propose the following algorithm 
(Algorithm 1) to compute the power spectrum of a ID real number 
signal using congruence derivative vector and the periodic shift 
sequences 5 p . Because the power spectrum from the algorithm 
corresponds to a specific periodicity, while a DFT power spectrum 
corresponds to a specific frequency, we name the method as PPS 
(Periodic Power Spectrum) algorithm. 


Input: a real number signal x of N, periodicity p 
Output: PPS at periodicity p 

Step: 

1. Generate a vector 

C = [1, cos(2it/p), cos(Ait/p), ., cos(2(p — l)7r/p)]. 

2. Generate a vector 

V = [0, sin(2tr/p), sin(Atr/p), ., sin(2(p — l)7r/p)]. 

3. Obtain an matrix U = C T C + V T V, where C T and V T are 
the transposes of vectors C and V, respectively. 

4. Construct the spectrum transform matrix S p of size p x p 
from the matrix U: 

if k > j then return S p (k, j)=U(k,j)+U(j,k) 

if k = j then return S p (k,j)= 1 
else return S p ( k, j ) = 0 

5. Compute the congruent derivative vector f p of the signal at 
periodicity p (Equation (5)). 

6. Compute the periodic power spectrum PPS(p) by S p and f p : 

PPS(p) = f P *S P * /J 

where f p T is the transpose of congruent derivative vector f p . 

Algorithm 1: Algorithm for computing PPS at periodicity p of a 
real number signal. 


2.4 Algorithm for computing periodic power spectrum 
of a DNA sequence 

From the definition of DFT power spectrum and the theorem 1, we 
then get the power spectrum of a DNA sequence after converting the 
sequence to 4D binary indicator sequences. For example, the power 
spectrum to describe 3-base periodicity property of protein coding 
regions in a DNA sequence can be obtained by its four congruence 
derivative vectors of the DNA sequence, which does not need to 
perform Fourier transform (IYin and YauLl200% . Here, using PPS for 
ID real signal, we propose the following algorithm to compute the 
power spectrum at any periodicities using the congruent derivative 
vectors f a ,ot £ {A,T, C, G} at periodic positions. We propose 
following algorithm (Algorithm 2) to compute the power spectrum 
at any periodicities in a DNA sequence. The inputs to the PPS 
algorithm are DNA sequence and the spectrum transform matrix 
S p to compute power spectrum at periodicity p from the congruent 
derivative vectors. 

After we obtain the spectrum transform matrix Sp and the 
congruent derivative vectors from the algorithm 2 steps 1 — 3, let 
Jup-o « £ {A, T, C, G} be the elements of the 
congruent derivative vector, f a , of periodicity p in a DNA sequence, 
which are the occurring frequency of nucleotide a, we present 
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Input: DNA sequence of length N, periodicity p 
Output: PPS at periodicity p 

Step: 

1. Convert the DNA sequence into four binary indicator 
sequences u a , a £ {A, T, G, G}. 

2. Compute the congruent derivative vectors f a of the sequences 
u a ,a £ {A, T, C, G}, at periodicity p. 

3. Compute the spectrum transform matrix S p of size p ( refer to 
algorithm 1, steps 1-4). 

4. Compute the power spectrum PPS(p) by the four congruent 
derivative vectors f a of periodicity p: 

pps( P ) = y 

a€{A,T,C,G} 

where is the transpose of congruent derivative vector f a . 

Algorithm 2: Algorithm for computing the PPS at periodicity p of 
a DNA sequence. 


Algorithm 2 to compute the PPS(p) at periodicity p of the DNA 
sequence. The PPS(p) can also be computed as follows: 

pps( P )= y \Yfl + E s P (k,j)f aj f ak 

ae{A,T,C,G } \q=0 k=0,j=0,k>j 

( 10 )' 

Because short periodicities often receive much attention in 
genome study, we provide the formulas for calculating the PPS 
spectrum of short periodicities, we construct the spectrum transform 
matrices for special short periodicities, S 2 , S 3 , Si, S 3 , Se, as 
in Table 1 in supplementary materials. We then have following 
formulas for the PPS spectrum at a few short periodicities. 

pps( 2 )= y (A, +A - 2 /“ O /“0 dD 

ae{A,T,C,G} 


PPS( 3) = £ E /*« - /<*0 /-I - /«0 f a 2 - f ai U 2 

ae{A,T,C,G} \q =0 

( 12 ) 

PPS( 4) = Y ( E A - 2 /“0 f a 2 - 2f ai U 3 

a£{A,T,C,G } \q=0 


(13) 

The signal-to-noise ratio (SNR) of a DNA sequence at a 
periodicity p, is defined as its PPS power spectrum at periodicity 
p divided by the average DFT power spectrum. From our previous 
study, it was found that the average DFT power spectrum 
corresponds to the sequence length N IlYin and Yaull2007t) . Thus the 
SNR of PPS spectrum a DNA sequence at periodicity p is defined 
as the PPS power spectrum at periodicity p divided by the length of 
the DNA sequence. 


S\Ii(p) = PP ^ {P) (14) 

In this paper, we chose the threshold of SNR as 1 to differ a true 
periodicity and background noise. 


3 RESULTS AND DISCUSSIONS 

3.1 PPS power spectrum analysis of a periodic signal 

To illustrate the effectiveness of the PPS algorithm in the 
identification of hidden periodicities in signals, we applied the PPS 
algorithm to the following periodic signal that consists of sine and 
cosine signal with periodicities 20 and 50, and is corrupted by white 
Gaussian noise: 

x(n) = sm(27r— + —) + cos(27r— + —) + noise 
20 4 50 4 

n = 1 : 300 

Figure 1(a) is the plot of the original periodic signal and shows that 
two periodicities 20 and 50 in the signal are hidden by random noise. 
Figure 1(b) is the plot of Fourier power spectrum vs frequency and 
shows that the two periodicities can be clearly identified by Fourier 
transform. The Fourier power spectrum plotting is over frequency 
domain, thus the positions of the two periodicity peaks 20 and 50 are 
at frequency N/p, i.e., frequency 15 and 6, respectively. Figure 1(c) 
is the plot of PPS spectrum vs periodicity. Two pronounced peaks 
of PPS spectrum in Figure 1(c) at positions 20 and 50 represent the 
two periodicities hidden in the original signal. The strengths of the 
periodicity 20 and 50 from the Fourier power spectrum also equal to 
the strengths of the corresponding PPS spectrum, signals from the 
In comparison of the spectra of Fourier transform and PPS, we can 
see that PPS spectrum can be used to identify periodicities directly 
in frequency domain. In addition, from Figure 1(b) and (a), we can 
see the PPS spectrum has less noise background than the Fourier 
power spectrum. 

3.2 PPS power spectrum analysis of DNA sequences 

We also studied on the impact of spectral leakages on Fourier 
power spectrum and PPS spectrum. For periodic sinusoidal signals, 
A = sin(27r finT), where N is the number of samples and T is 
the distance between two samples, and n = 0,1, 2 , ...N — 1. The 
Fourier power spectrum of the signal has a peak proportional to 
amplitude at the frequency /,;. If the frequency fi of the signal is 
not multiple of F, F = 1/NT, a distorted DFT is_obtained. This 
phenomenon is called spectral leakage dCosta and Meluccl l2010l : 
iLvonL 120091) . Because spectral leakage makes the recognition of 
the correct frequencies of the signal difficult, it shall be avoided 
in digital signal processing. To avoid spectral leakage at a specific 
periodicity, we can pad zeros to make the length of DNA sequence 
as a multiple of the periodicity. For example, to compute accurate 
power spectrum at periodicity 3, the length of the DNA sequence 
after padding zeros shall be a multiple of 3. 

The test DNA sequence in the spectral leakage analysis 
is an artificial sequence, denoted as N130P5. The DNA 
sequence contains 6 copies imperfect 5-base repeat ATCGA. A 
deletion mutant of the DNA sequence is constructed by deleting 
two nucleotides AA at the 3’ end, denoted as N130P5-D2. 
The sequences N130P5 and N130P5-D2 are provided in the 
supplementary materials. The Fourier power spectrum and PPS 
spectrum are shown in Figure 2 and Table 3. The Fourier power 
spectrum and PPS spectrum for the periodicity 5 are the same, the 
peak value is 361.9837 (Figure 2(a)(c)). The results verify that the 
PPS spectrum at periodicity 5 is the same as that from the DFT 
method on the sequence when its length is multiple of periodicity 

5. Figure 2(a) is the comparison of Fourier power spectrum of 
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Fig. 1: Power spectrum analysis of periodic sine and cosine signal, 
(a) Periodic sine and cosine signal collapsed with white noise, (b) 
DFT power spectrum of the signal, (c) PPS power spectrum of the 
signal. 


N130P5 and N130P5-D2 with zero padding. The result in Figure 
2(a) shows that zero padding for the deletion mutant can have 
similar spectrum as the original DNA sequence. Flowever, if there 
is no zero padding in the N130P5-D2 sequence as shown in Figure 
2(b), the Fourier power spectrum at periodicity 5 becomes 212.0118 
and is much different from the corresponding values from padding 
zeros (335.803) and PPS spectrum of original sequence (335.8034). 
The reason for these difference is that the length of the deletion 
mutant is not multiple of 5 and the periodicity 5 is hidden in Fourier 
spectrum because of spectral leakage. In addition, PPS spectrum can 
be considered as zero padding for each periodicity as shown from 
equation (7). These results indicate that Fourier spectrum may not 
identify some latent periodicities due to spectrum leakage, but the 
PPS spectrum can detect the hidden periodicity no matter whether 
the length of the sequence is multiple of the periodicity. This is the 
main advantage of the PPS spectrum. 


Table 3. Comparison of Fourier power spectrum and PPS spectrum on 
simulated DNA sequences 


Sequence 

DFT:PS(f) 

PPS:PS(5) 

N130P5 

361.9837 

361.9837 

N130P5-D2 (with padding zeros) 

335.8034 

335.8034 

N130P5-D2 (no padding zeros) 

212.0118 

335.8034 


The PPS spectrum method was assessed on well-studied 3- 
periodicity of exon sequences. Figure 3(a) is Fourier power 
spectrum Homo sapiens cytochrome oxidase subunit I (COI) gene 
(GenBank ID: EU834863, 617 bp). Figure 3(b) is the PPS spectrum 
and indicates a significant periodicities 3 and a weak periodicity 8 in 
the gene. We then examine these two periodicities using DNA walk 
and_sliding_window approaches as described in our previous study 
lYin and Yaul 120071) Figure 3(c) is the PPS spectra of periodicities 
3 and 8 using DNA walk. The strengths of the PPM spectrum for 
the two periodic signals increase when the length of the DNA walk 
increases. Figure 3(d) is the PPS spectra of periodicities 3 and 8 in 
different sliding windows. From the sliding window plots, we can 
identify the approximate regions in the gene for these two periodic 
signals. These results show that the weak periodicity 8 can only be 
identified in PPS spectrum, but not in Fourier power spectrum. 

We performed other tests on DNA sequences of different lengths, 
the computational results confirm that the PPS spectrum and Fourier 
power spectrum are the same for periodicity p only when the length 
of the DNA sequences are the multiple of p. If the length of the DNA 
sequence is not a multiple of the periodicity p, the Fourier power 
spectrum is not equivalent to the PPS spectrum. This indicates that 
one periodic signal may be hidden by other periodicities in Fourier 
power spectrum analysis. This study shows that the Fourier power 
spectrum signal computation in DNA sequence is impacted by 
length of DNA sequences, but our new PPS spectrum can overcome 
the length problem as in Fourier spectrum analysis. 

3.3 Periodicities in microsatellite sequences 

We evaluated the effectiveness of the PPS method by comparing 
Fourier and PPS spectra of typical microsatellite DNA sequences. 
The most studied groups of tandem repeats in genomes are 
microsatellites (patterns of 10 bp) and minisatellites (patterns of 100 
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Fig. 2: Power spectrum analysis of DNA sequences, (a) Fourier 
power spectra of DNA sequence N130P5 and N130P5-D2, (b) 
Fourier power spectrum of N130P5-D2, (c) PPS spectra of and 
N130P5 and N130P5-D2. 






Fig. 3: Power spectra of Homo sapiens cytochrome oxidase subunit 
1 (COI) gene, (a) Fourier power spectrum, (b) PPS spectrum, 
(c) PPS spectra of DNA walk, (d) sliding window PPS spectrum, 
window size = 60 bp 
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bp) because of their use as genetic markers in forensics, parentage 
assessment, positional cloning and population and evolutionary 
genetics. Microsatellites are abundant and distributed all over the 
eukaryotic genomes with variable frequency. The period detection 
method using PPS was assessed on Human microsatellite repeat 
KLK1 AC DNA (GenBank Accession: M65145, 1072 bp). Figure 
4(a) shows the Fourier power spectrum of the DNA sequence. 
Although the Fourier power spectrum indicates there are two high 
peaks corresponding to approximate 11 and 12 periodicities, it is 
difficult to identify other exact periodicities from the Fourier power 
spectrum. Figure 4(b) is the PPS spectrum SNR at periodicities 1- 
100. It is clear that the DNA sequence contains a few periodicity 
peaks with large spectrum SNR values. The significant SNR values 
for short periodicities are: Pt: 1.3873, P8H.2002, Pn:1.5389, 
Pi2:2.4196. Our method can thus preciously identify 4 periodicities 
(repeats): 7, 8, 11, and 12 . This sequence had b een studied by othe r 
repeat finding methods dSharma et all 120041 : iGupta et~ail l2007t) . 
but those methods can only detect 2 or 11 mer repeats and were 
unable to detect the four periodicities 7, 8, 11, and 12. We can 
locate the positions of the periodicities in the DNA sequence using 
sliding window PPS. Figure 4(c) is the sliding window PPS of 
periodicities 7, 11 and 12. The result shows the different locations 
for these periodic signals on the DNA sequence. The 11-mer 
repeats are located between 92 and 781 bp. When comparing 
the sliding window PPS SNR plot of 11 bp and 12 periodicities 
(Figure 4(b),(c)), we can see that the peak regions from 11 and 
12 periodicity are not the same, but the region centered at 200 
bp only shows 11 periodicity, but not 11 periodicity. This result 
suggest the 11 periodicity is not exact derived from 12 periodicity. 
The locations the periodicity 11 are in ag reement of previous studies 
dSharma et alll20M : lGupta et alll2007 ). These results clearly show 
the advantage of our algorithm in identification of repeats. The PPS 
method can detect more latent periodicities and hidden periodic 
patterns and locate preciously the locations of these periodicities. 

The second example is periodicity detection in Human 
microsatellite repeats (GenBank locus:HSVDJSAT, 1985 bp) using 
the PPS method. This DNA sequence contains variable length 
tandem repeats (VLTRs) dllauth and Josephl . l2002l l. Figure 5(a) is 
the Fourier power spectrum of the DNA sequence and suggests. 
It indicates the complexity and high noise level in Fourier power 
spectrum, and thus it is difficult to detect periodicities from 
Fourier power spectrum. From the PPS spectrum in Figure 
5(b), we can detect accurately the short periodicities: 4, 6, 8, 
10, 22, 49, and 50. The corresponding SNR values of these 
periodicities are: P4:1.2781, Pe:1.4620, Ps:1.0349, Pro: 1.7631, 
P 22 :l.0781,P4 q: 1.1268,Pso:l. 1023. The strongest periodicity in 
this DNA sequence is 10 periodicity, which i s in agreement wit h 
literature methods Jllauth and Joseph! . 1 20021 : IGupta et~ail 1 20071) . 
From the sliding window PPS spectra of periodicities 4, 6 and 10 
in Figure 5(c), we can locate the positions of these periodicities. 
For example, a major location of the periodicity 4 is between 700- 
900 bp; the periodicities 6 is mainly located around 400 bp and 
1000 bp, periodicities 1 0 is mainly located at 1180 bp and 1400 bp . 
The previously studies l lHauth and Josephl . l2002l : lGupta et alll2007h 
can identify similar periodicities but those methods need different 
threshold settings and sometimes get conflicting results. 

These examples demonstrate that the PPS spectrum can detect a 
broad range of latent and complex periodicities in DNA sequences, 
which are missed in current state of art methods. 





Fig. 4: Power spectra of DNA sequence M65145. (a) Fourier power 
spectrum, (b) PPS spectrum SNR, (c) sliding window PPS spectrum 
SNRs of different periodicities, window size = 60 bp. 
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Fig. 5: Power spectra of DNA sequence HSVDJSAT. (a) Fourier 
power spectrum, (b) PPS spectrum SNR, (c) sliding window PPS 
spectrum SNRs of different periodicities, window size = 100 bp. 


3.4 Computational complexity 

One of the challenging problems when applying Fourier transform 
to sequence study is the computational complexity. Fourier 
transform attempts to explain a data set as a weighted sum of 
sinusoidal basis element when the data can be closely approximated 
by such elements, however, directly searches for periodicities, 
repetitions, or regularities in the data. However, Fourier transform is 
computational intensive technique. The FFT still needs 0(N log N) 
computational time. Because our PPS algorithm only computes the 
spectrum for periodicities of special interests, there is no need to 
compute Fourier transform for all frequencies, it can significantly 
reduce the computation complexity. In addition, since the spectrum 
transform matrix S p is real number matrix, computation of PPS is 
performed only on real numbers, not complex numbers, this reduces 
computation time of the PPS algorithm. 

3.5 Discussion 

From this study, the major advantage of the PPS spectrum is that it 
can capture all the tandem periodicities because the PPS spectrum 
reduces high noise level and spectral leakage as in Fourier spectrum. 
It is worthy noting that the power spectrum method in this paper can 
also be applied on detection periodicities in other time series such 
as protein sequence and biomedical signals. An open question is 
that the PPS spectrum after a specific periodicity becomes smooth 
and lacks of characters of peaks (Figure 1(c), 2(c), 3(b)). We 
examine the value this specific periodicity, we found the position 
approximately equals to %/2n. We will investigate this problem in 
future studies. 


4 CONCLUSIONS 

Periodic latent regions in genomes often play important roles in 
many genomic functions. Carefully understanding the source and 
pattern of periodicity in genomic sequences thus represents an 
important problems in biology. 

In this paper, we propose a novel power spectrum based on the 
periodic distribution of symbols. We also propose an algorithm to 
compute PPS spectrum. The algorithm can identify and locate exact 
and inexact periodic patterns in DNA sequences. The algorithm has 
been assessed for identification of periodicities in different genomic 
elements, including exons and microsatellite DNA sequences. The 
results. The experimental results demonstrate the effectiveness 
of our algorithm on periodicity identification and minimizing 
spectral leakage. The results demonstrate the utility of analyzing 
the genomes in the periodicity space and our method can be 
deterministic method for periodicity identification. 


REFERENCES 

Afreixo, V., Ferreira, P. J., Santos, D., 2004. Fourier analysis of symbolic data: A brief 
review. Digital Signal Processing 14 (6), 523-530. 

Arora, R., Sethares, W. A., 2007. Detection of periodicities in gene sequences: a 
maximum likelihood approach. In: Genomic Signal Processing and Statistics, 2007. 
GENSIPS 2007. IEEE International Workshop on. IEEE, pp. 1^4. 

Buchner, M., Janjarasjitt, S., 2003. Detection and visualization of tandem repeats in 
DNA sequences. Signal Processing, IEEE Transactions on 51 (9), 2280-2287. 

Costa, A., Melucci, M., 2010. An information retrieval model based on discrete fourier 
transform. In: Advances in Multidisciplinary Retrieval. Springer, pp. 84-99. 


8 





























































Periodic power spectrum and periodicities in DNA sequences 


Epps, J., Ying, H., Huttley, G. A., 2011. Statistical methods for detecting periodic 
fragments in DNA sequence data. Biology direct 6 (21), 1-16. 

Gluncic, M., Paar, V., 2013. Direct mapping of symbolic DNA sequence into frequency 
domain in global repeat map algorithm. Nucleic acids research 41 (1), el 7-el 7. 

Grover, A., Aishwarya, V., Sharma, P., 2012. Searching microsatellites in DNA 
sequences: approaches used and tools developed. Physiology and Molecular Biology 
of Plants 18 (1), 11-19. 

Gupta, R., Sarthi, D., Mittal, A., Singh, K., 2007. A novel signal processing measure 
to identify exact and inexact tandem repeat patterns in DNA sequences. EURASIP 
Journal on Bioinformatics and Systems Biology 2007, 3-3. 

Gymrek, M., Golan, D., Rosset, S., Erlich, Y., 2012. lobSTR: a short tandem repeat 
profiler for personal genomes. Genome research 22 (6), 1154-1162. 

Hauth, A. M., Joseph, D. A., 2002. Beyond tandem repeats: complex pattern structures 
and distant regions of similarity. Bioinformatics 18 (suppl 1), S31-S37. 

Herzel, H., Weiss, O., Trifonov, E. N., 1999. 10-11 bp periodicities in complete 
genomes reflect protein structure and DNA folding. Bioinformatics 15 (3), 187-193. 

Illingworth, C. J., Parkes, K. E., Snell, C. R., Mullineaux, P. M., Reynolds, C. A., 2008. 
Criteria for confirming sequence periodicity identified by fourier transform analysis: 
application to GCR2, a candidate plant GPCR? Biophysical chemistry 133 (1), 28- 
35. 

Korotkov, E. V., Korotkova, M. A., Kudryashov, N. A., 2003. Information 
decomposition method to analyze symbolical sequences. Physics Letters A 312 (3), 
198-210. 

Lyon, D. A., 2009. The discrete fourier transform, part 4: spectral leakage. Journal of 
object technology 8 (7). 

Messaoudi, I., Oueslati, A., Lachiri, Z., 2014. Building specific signals from frequency 
chaos game and revealing periodicities using a smoothed fourier analysis. 

Narzisi, G., Schatz, M., 2015. The challenge of small-scale repeats for indel discovery. 
Frontiers in Bioengineering and Biotechnology 3 (8). 

Shapiro, J. A., Sternberg, R., 2005. Why repetitive DNA is essential to genome 
function. Biological Reviews 80 (2), 227-250. 

Sharma, D., Issac, B., Raghava, G., Ramaswamy, R., 2004. Spectral repeat finder (srf): 
identification of repetitive sequences using fourier transformation. Bioinformatics 
20 (9), 1405-1412. 

Silverman, B., Linsker, R., 1986. A measure of DNA periodicity. Journal of theoretical 
biology 118(3), 295-300. 

Sutherland, G. R., Richards, R. I., 1995. Simple tandem DNA repeats and human 
genetic disease. Proceedings of the National Academy of Sciences 92 (9), 3636- 
3641. 

Suvorova, Y. M., Korotkova, M. A., Korotkov, E. V., 2014. Comparative analysis 
of periodicity search methods in DNA sequences. Computational biology and 
chemistry 53, 43^48. 

Tiwari, S., Ramachandran, S., Bhattacharya, A., Bhattacharya, S., Ramaswamy, 
R., 1997. Prediction of probable genes by fourier analysis of genomic sequences. 
Bioinformatics 13 (3), 263-270. 

Treangen, T. J., Salzberg, S. L., 2011. Repetitive DNA and next-generation sequencing: 
computational challenges and solutions. Nature Reviews Genetics 13 (1), 36—46. 

Trifonov, E. N., 1998. 3-, 10.5-, 200-and 400-base periodicities in genome sequences. 
Physica A: Statistical Mechanics and its Applications 249 (1), 511-516. 

Voss, R., 1992. Evolution of long-range fractal correlation and 1/f noise in dna base 
sequences. Physical Review Letters 68, 3805-3808. 

Wang, J., Liu, G., Zhao, J., 2012. Some features of Fourier spectrum for symbolic 
sequences. Numerical Mathematics A Journal of Chinese Universities (4), 341-356. 

Wang, L., Stein, L. D., 2010. Localizing triplet periodicity in DNA and cDNA 
sequences. BMC bioinformatics 11 (1), 550. 

Wojcik, E. A., Brzostek, A., Bacolla, A., Mackiewicz, R, Vasquez, K. M., Korycka- 
Machala, M., Jaworski, A., Dziadek, J., 2012. Direct and inverted repeats elicit 
genetic instability by both exploiting and eluding DNA double-strand break repair 
systems in mycobacteria. PloS one 7 (12), e51064. 

Yin, C., 2014. Representation of DNA sequences in genetic codon context 
with applications in exon and intron prediction. Journal of bioinformatics and 
computational biology. 


Yin, C., Yau, S. S.-T., 2005. A Fourier characteristic of coding sequences: origins and a 
non-Fourier approximation. Journal of Computational Biology 12 (9), 1153-1165. 
Yin, C., Yau, S. S.-T., 2007. Prediction of protein coding regions by the 3-base 
periodicity analysis of a DNA sequence. Journal of Theoretical biology 247 (4), 
687-694. 

Table 1. Spectrum transform matrices of some short periodicities 


Periodicity 


Spectrum transform matrix S p 


P 2 


1 0 

-2 1 


Ps 


' 1 0 0 

-1 1 0 

-1 -1 1 


Pi 


1 0 0 0 
0 10 0 
-2010 
0-201 


Ps 
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Vs-i 
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0 
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1 

0 

0 

0 

0 

O' 

1 

1 

0 

0 

0 

0 

-1 

1 

1 

0 

0 

0 

1 

to 

-1 

1 

1 

0 

0 

-1 

-2 

-1 

1 

1 

0 

1 

-1 

-2 

-1 

1 

1 


0" 

0 

0 

0 
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SUPPLEMENTARY MATERIALS 

The spectrum transform matrices (Table 1) and DNA 
sequences 

N 130P5=CCATATCCGATCGGCAGCGCGTGCC TTTTATCGCTATCGATCGAATf 
CCGCGGCTGTCTATAGAAAAATTATAAATGATATTGATCCGAGTAGGGTCCC 
GCACTTCAA 

N130P5-D2=CCATATCCGATCGGCAGCGCG TGCCTTTTATCGCTATCGATCC 
GAGGACCGCGGCTGTCTATAGAAAAATTATAAATGATATTGATCCGAGTAGC 
GTGCGGGGCACTTC 
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